Sparse Communication for Distributed Gradient Descent
نویسندگان
چکیده
We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.
منابع مشابه
Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We ...
متن کاملSparse Diffusion Steepest-Descent for One Bit Compressed Sensing in Wireless Sensor Networks
This letter proposes a sparse diffusion steepestdescent algorithm for one bit compressed sensing in wireless sensor networks. The approach exploits the diffusion strategy from distributed learning in the one bit compressed sensing framework. To estimate a common sparse vector cooperatively from only the sign of measurements, steepest-descent is used to minimize the suitable global and local con...
متن کاملPreserving communication bandwidth with a gradient coding scheme
Large–scale machine learning involves the communicaiton of gradients, and large models often saturate the communication bandwidth to communicate gradients. I implement an existing scheme, quantized stochastic gradient descent (QSGD) to reduce the communication bandwidth. This requires a distributed architecture and we choose to implement a parameter server that uses the Message Passing Interfac...
متن کاملNetwork Newton–Part II: Convergence Rate and Implementation
The use of network Newton methods for the decentralized optimization of a sum cost distributed through agents of a network is considered. Network Newton methods reinterpret distributed gradient descent as a penalty method, observe that the corresponding Hessian is sparse, and approximate the Newton step by truncating a Taylor expansion of the inverse Hessian. Truncating the series at K terms yi...
متن کاملAn Asynchronous Distributed Proximal Gradient Method for Composite Convex Optimization
We propose a distributed first-order augmented Lagrangian (DFAL) algorithm to minimize the sum of composite convex functions, where each term in the sum is a private cost function belonging to a node, and only nodes connected by an edge can directly communicate with each other. This optimization model abstracts a number of applications in distributed sensing and machine learning. We show that a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017